The Penn Parsed Corpus of Modern British English: First Parsing Results and Analysis
نویسندگان
چکیده
This paper presents the first results on parsing the Penn Parsed Corpus of Modern British English (PPCMBE), a millionword historical treebank with an annotation style similar to that of the Penn Treebank (PTB). We describe key features of the PPCMBE annotation style that differ from the PTB, and present some experiments with tree transformations to better compare the results to the PTB. First steps in parser analysis focus on problematic structures created by the parser.
منابع مشابه
The Icelandic Parsed Historical Corpus (IcePaHC)
We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12 century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern I...
متن کاملC-structures and F-structures for the British National Corpus
We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, ...
متن کاملCreating a Dual-Purpose Treebank
We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12 century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern I...
متن کاملImproving Dependency Parsing with Subtrees from Auto-Parsed Data
This paper presents a simple and effective approach to improve dependency parsing by using subtrees from auto-parsed data. First, we use a baseline parser to parse large-scale unannotated data. Then we extract subtrees from dependency parse trees in the auto-parsed data. Finally, we construct new subtree-based features for parsing algorithms. To demonstrate the effectiveness of our proposed app...
متن کاملThe HeliPaD : a parsed corpus of Old Saxon
This short note introduces the HeliPaD, a new parsed corpus of Old Saxon (Old Low German). It is annotated according to the standards of the Penn Corpora of Historical English, enriched with lemmatization and additional morphological attributes as well as textual and metrical annotation. This note provides an overview of its main features and compares it to existing resources such as the Deutsc...
متن کامل